Restricting datasets to classifiable samples augments discovery of immune disease biomarkers

Immunological diseases are typically heterogeneous in clinical presentation, severity and response to therapy. Biomarkers of immune diseases often reflect this variability, especially compared to their regulated behaviour in health. This leads to a common difficulty that frustrates biomarker discovery and interpretation – namely, unequal dispersion of immune disease biomarker expression between patient classes necessarily limits a biomarker’s informative range. To solve this problem, we introduce dataset restriction, a procedure that splits datasets into classifiable and unclassifiable samples. Applied to synthetic flow cytometry data, restriction identifies biomarkers that are otherwise disregarded. In advanced melanoma, restriction finds biomarkers of immune-related adverse event risk after immunotherapy and enables us to build multivariate models that accurately predict immunotherapy-related hepatitis. Hence, dataset restriction augments discovery of immune disease biomarkers, increases predictive certainty for classifiable samples and improves multivariate models incorporating biomarkers with a limited informative range. This principle can be directly extended to any classification task.


Density Density Density
Supplementary Figure 1 Class sizes do not determine the informative value of disease biomarkers.
In this supplementary note, we explain the relationship between the discriminatory value of disease biomarkers, class sizes and correct classification rates.Most readers intuitively appreciate that a test's accuracy (also called the correct classification rate) depends upon the relative number of cases in each class.However, a marker's discriminatory value is independent of class size.Figure 1 explains this critical distinction through four illustrated examples, arranged in columns (1 -4) each consisting of 5 panels (a -e).
The vertically arranged panels depict the same datasets treated in five different ways: (a) Boxplot of the biomarker values for both diseased (red) and non-diseased (green) samples.250 randomly selected samples represented as individual points.The boxplot provides a visual summary of the distribution of the biomarker values.An arbitrary discriminatory cutoff is displayed at a biomarker value of 5.8.

(b)
Histogram of biomarker values for diseased (red) and non-diseased (green) samples.The histogram illustrates the distribution of cases with respect to biomarker values in each class.The same arbitrary discriminatory cutoff is displayed at a biomarker value of 5.8.
(c) Confusion matrix summarizing the predicted class according to the arbitrary biomarker value cutoff value of 5.8.This table shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FP).From these numbers, we can calculate the true positive rate (TPR), false positive rate (FPR) and accuracy.Hence, the confusion matrix helps to evaluate the performance of a biomarker in predicting disease status for one specific cutoff value.
(d) Receiver operating characteristic (ROC) curves are graphical representations of the performance of a biomarker in distinguishing between diseased and non-diseased samples.ROC curves plot the true positive rate (sensitivity) against the false positive rate (1-specificity) calculated for all biomarker values in the dataset.As described in Figure 1, the area under the ROC (AUC) is a measure of a marker's discriminatory power.
(e) Densities for diseased (red) and non-diseased samples.Imagine a histogram divided by the total number of samples in the respective class.
Let us first consider columns 1 and 2, which represent a biomarker with no discriminatory value: (a) Regardless of the number of diseased (red: column 1, n=6000; column 2, n=8000) or non-diseased (green: column 1, n=4000; column 2, n=2000) samples in each class, the boxplots shown in (a1) and (a2) are almost indistinguishable.

(b)
In contrast, the histograms in (b1) and (b2) clearly show more diseased samples in column 2.
(c) By definition, no biomarker value is useful for discriminating between the two classes.Our arbitrary choice of 5.8 as a cutoff leads to confusion tables (c1) and (c2).The accuracy of this test decreases from (c1; 45%) to (c2; 32%) because more samples happen to fall below the cutoff.Importantly, TPR (21%) and FPR (21%) are identical irrespective of class size.

(d)
The ROC curves in (d1) and (d2) plot TPR against FPR for these samples.Because TPR and FPR are not affected by class size, the ROC curves and their area under the curve are essentially identical (d1: AUC = 0.50; d2: AUC = 0.51).Importantly, our interpretation of these curves as showing a biomarker with no discriminatory value is not affected by the different class sizes in (d1) and (d2).

(e)
The density plots shown in (e1) and (e2) help us to understand why our hypothetical biomarker has no discriminatory power.By normalizing the class sizes, we see the biomarker distributions in the diseased (red) and non-diseased (green) classes are essentially identically shaped and completely overlapping.ROC curves are based on densities, not absolute numbers of cases.
Let us now consider columns 3 and 4, which represent a biomarker with a high discriminatory value: (a) Regardless of the number of diseased (red: column 3, n=6000; column 4, n=8000) or non-diseased (green: column 3, n=4000; column 4, n=2000) samples in each class, the boxplots shown in (a1) and (a2) are almost indistinguishable.
(b) In contrast, the histograms in (b3) and (b4) clearly show more diseased samples in column 4.
(c) Our arbitrary choice of 5.8 as a cutoff leads to confusion tables (c3) and (c4).The accuracy of this test increases slightly from (c3; 85%) to (c4; 87%) owing to the relative differences in class sizes.Importantly, TPR (89%) and FPR (21%) are identical irrespective of class size.

(d)
The ROC curves in (d3) and (d4) plot TPR against FPR for these samples.Because TPR and FPR are not affected by class size, the ROC curves are essentially identical (d3: AUC = 0.92; d4: AUC = 0.92).Importantly, our interpretation of these curves as showing a biomarker with high discriminatory value is not affected by the different class sizes in (d3) and (d4).
(e) The density plots shown in (e1) and (e2) help us to understand why our hypothetical biomarker has high discriminatory power.By normalizing the class sizes, we see the biomarker distributions in the diseased (red) and non-diseased (green) classes are essentially identically shaped but only have a small overlap.
In summary, class sizes may affect the accuracy of a test, but not its TPR or FPR.Hence, the discriminatory value of a marker, which is a function of TPR and FPR, is not determined by class sizes.Densities are a class size-independent way of visualising the expression of a disease biomarker within a patient class.Because densities are directly relevant to a marker's discriminatory value, we present plots of densities throughout this article.[Gated TCRγδ + CD8 + ]  2% of all positive samples are different N (9, 1) from the majority population N (6, 1).The optimal restriction is indicated with a red line (value = 6.8) and marker HIGH samples are kept which is identical to (a).The top-right panel shows the complete ROC curve with its restriction at FPR = 0.366.The bottom-left panel relates FPR to biomarker values for all samples.The bottom-right panel shows the restricted standardized AUC (rzAUC) for every possible restriction calculated for marker HIGH and marker LOW samples.The optimal restriction is indicated by a red line.Supplementary Figure 5 Schematic gating scheme for one real and three synthetic samples.
In this paper, our gating schema for the DURAClone IM T Cell Subsets Tube gates 1) CD3 + cells, 2) splits them into quadrants according to CD4 and CD8, 3) subsequently splits CD4 + /CD8 − and CD4 − /CD8 + into quadrants according to CD45RA and CCR7, 4) subsequently splits every quadrant into quadrants according to CD27 and CD28, 5) subsequently splits every quadrant into CD57 − , CD57 + /PD1 + and CD57 + /PD1 − .The first row shows this gating for sample D142.The next three rows show the gating for three simulated samples where CD8 + T EMRA were increased to have a mean of 33.23% in contrast to a baseline mean of 7.17% in 8 donors with 6 replicates each.Supplementary Figure 6 The restricted standardized AUC is biased.
Although the AUC of the complete ROC curve is unbiased, the optimized rAUC and rzAUC are biased metrics; therefore, to compare the discriminatory performance of different biomarkers, we introduce permutation p-values.(a) A simulated distribution of 2500 positive and 2500 negative samples, where 20% of all positive and 2% of all negative samples are different N (9, 1) from the majority population N (6, 1).The optimal restriction (value = 6.Supplementary Figure 7 Restriction applied to randomly labelled data We used data from our training set of 110 patients with advanced melanoma.We selected the proportion of

+
EM cells as the biomarker of interest.We then randomly permuted the class labels of the samples and applied restriction to the permuted data.We repeated this procedure 10000 times to generate restricted and unrestricted permutation p-values.Correction for multiple testing with the false discovery rate led to a q-value of 1 for all biomarkers.(a) Permutation p-values with (y-axis) and without (x-axis) restriction.Small unrestricted p-values tend to have smaller restricted p-values.This pattern arises from the fact that our method will result in no restriction being applied to the dataset if that yields better performance.Consequently, small unrestricted p-values also exhibit small restricted p-values.(b) Permutation q-values with (y-axis) and without (x-axis) restriction after p-value correction for multiple testing with the Benjamini-Hochberg procedure.Most of the q-values are 1, which is expected since the null hypothesis is true.(c) Permutation p-value distribution without restriction.We observe that the distribution is uniform as expected.There are 1033 biomarkers with an uncorrected permutation p-value below 0.1.(d) Permutation p-value distribution with restriction.We observe that the distribution is also uniform as expected.There are 987 biomarkers with an uncorrected permutation p-value below 0.1.9 a • Proteomic data from Harel et al. 2019.• FFPE samples from 74 patients treated with anti-PD1 immunotherapy.
• Objective: To find differentially expressed proteins between responders and non-responders defined by RECIST v1.0 guidelines.
• Restriction finds 2 otherwise missed significant proteins after correction for multiple testing at a q-value significance level of 0.  stable diseases according to RECIST v1.0 guidelines.We used the data available under Table S1 of the publication, sheets S1C and S1D.The authors were interested in significantly expressed proteins that discriminated between responders and non-responders: see Figure 3C    • FFPE samples from 42 patients treated with TILbased adoptive cell transfer.
• Objective: To find differentially expressed proteins between responders and non-responders defined by RECIST v1.0 guidelines.
• Restriction finds 2 otherwise missed significant proteins after correction for multiple testing at a q-value significance level of 0.  Permutation q−value unrestricted AUC Permutation q−value max restricted AUC

d e
Supplementary Figure 10 Restriction applied to proteomics -2 • A merged dataset of 312 faecal samples was analyzed, which comprised 165 new samples from 5 separate cohorts, plus 147 samples from four publicly available datasets.
• Across all samples, 668 species in 228 genera, 82 families, 42 orders, 26 classes and 14 phyla were identified.Here, we used all 1060 taxonomic identifiers as features for our reanalysis.
• Here, we used progression free survival (PFS12) as clinical endpoint for our reanalysis.
• Restriction finds 1 otherwise missed species and its parent genus after correction for multiple testing at a q-value significance level of 0.   • 18 pre-treatment peripheral blood samples from patients with metastatic melanoma treated with ICIbased immunotherapy.
• CyTOF data were analyzed with FlowSOM and 20 resulting subpopulations of major mononuclear lineages were annotated.
• Outcome was defined as severe immune related adverse event after treatment initiation.
• Restriction finds 2 otherwise missed subpopulations after correction for multiple testing at a q-value significance level of 0.  • A total of 15,896 genes were quantified.
• Outcomes were defined as objective response rate (ORR) according to RECIST v1.1 or irRECIST criteria.
• Here we report univariate restriction applied to 15896 measured genes.Restriction finds 19 otherwise missed genes after correction for multiple testing at a q-value significance level of 0.1.We show the performances of    where ( ) and ( ) are the probability density functions of and , respectively.Their derivatives are the negative probability density functions: and following Equation ( 6), [ ≥ −1 ( )] = . (17 We define as the joint probability density function of and for independent random variables and .We further describe the following probability in terms of the empirical survival functions: = ( ) − ( ), ∈ { , }. (21)

AUC for restricted sample space
Since we have (see Equation ( 77)) the area under the rROC curve is given by By definition, we have For rAUC we then obtain

Supplementary Figure 2
part 1/2.FlowSOM clustering results in TCR + cells as a biomarker for hepatitis.(a) PBMC from 64 patients were stimulated for 4 hours and stained for cell surface antigen and intracellular cytokines.FlowSOM analysis was performed on CD3 + cells with 9 metaclusters and 196 clusters.Metacluster 9 coloured blue, cluster 173 coloured pink (n=42 no hepatitis, n=22 hepatitis).(b) There was a higher abundance of cells in metacluster 9 in patients who subsequently developed hepatitis, tested with the Mann-Whitney U test.(c) Metacluster 9 corresponded to a population of TCR + cells.(d) The proportion of cells in cluster 173 was higher in hepatitis patients.We applied multiple Mann-Whitney U tests to compare the abundance of cells in each cluster between patients with and without hepatitis.P-values were adjusted for multiple testing using the Benjamini-Hochberg procedure.(e) Cluster 173 included mostly CD8 + TCR + cells.(f) TCR + cells are distributed in different areas of the viSNE plot; metacluster 9 was mainly formed by CD8 + and double negative (DN) T cells.

Supplementary Figure 3 2 Supplementary Figure 4
Restriction applied to count distributionsCount distributions are usually modelled as coming from a negative binomial distribution.The negative binomial distribution ( ( , )) is parametrized by its mean and the dispersion .Weshow four examples of two classes from differently parametrized negative binomial distributions.The distributions are shown after log 2 (simulated counts + 1) transformation.Note that our method would find the same informative range with or without log transformation because it is based on the ROC curve.Each subfigure presents log 2 transformed densities and their corresponding ROC curves.The method-selected restriction value is emphasized as a vertical red line.For both the positive and negative classes, 2500 samples were simulated.A log 2 -fold change of 2.5 was used as the difference between the two classes.We chose a mean count of 100 (log 2 (100 + 1) = 6.66) for the negative population and a mean count of 625 (log 2 (625 + 1) = 9.29) for the positive population.(a) Simulated example of positive ( (9.29, 1)) and negative ( (6.66, 1)) samples are shown.Restriction identifies the full range as informative.(b) Simulated example of positive ( (9.29, .25))and negative ( (6.66, .25))samples are shown.Restriction excludes samples with zero counts.(c) Simulated example of positive ( (9.29, .05))and negative ( (6.66, .05))samples are shown.Restriction identifies the informative range as samples with more than 2 8.403 ≈ 338 counts.(d) Simulated example of positive ( (9.29, .05))and negative ( (6.66, .07))samples are shown.Restriction identifies the informative range as samples with more than 2 4.67 ≈ 25 counts.Swapped positive and negative class restriction.Exchanging the labels of the positive and negative classes leads to an inversion of the ROC curve, but does not affect the optimal restriction.(a) The top-left panel shows the distribution of 2500 positive and 2500 negative samples where 20% of all positive and 2% of all negative samples are different N (9, 1) from the majority population N (6, 1).The optimal restriction is indicated with a red line (value = 6.8) and marker HIGH samples are kept.The top-right panel shows the complete ROC curve with the same restriction indicated at FPR = 0.258.The bottom-left panel relates FPR to biomarker values for all samples.The bottom-right panel shows the restricted standardized AUC (rzAUC) for every possible restriction calculated for marker HIGH and marker LOW samples.The optimal restriction is indicated by a red line.(b) This hypothetical example is identical to that shown in (a) except that the positive and negative class labels were switched when calculating the ROC curve.The top-left panel shows the distribution of 2500 negative and 2500 positive samples where 20% of all negative and

8 )
is indicated by a red line.(b) The corresponding complete ROC curve for all samples with the optimal restriction (FPR = 0.258) is indicated by a red line.(c) A plot of biomarker values against FPR calculated for all samples.The optimal restriction is indicated by red lines.(d) The restricted standardized AUC (rzAUC) calculated for marker HIGH (orange) and marker LOW (blue) samples calculated for all possible restrictions.The optimal restriction is indicated by a red line.(e) Plot relating the AUC of the complete ROC curve to the rAUC for 10000 permutations of the positive-and negative-class labels.The red point represents the observed value for the unpermuted (i.e.correctly labelled) data.(f) Plot relating the standardized AUC of the complete ROC curve (zAUC) to the rzAUC for 10000 permutations of the positive-and negative-class labels.The red point represents the observed value for the unpermuted data.
of their publication.(a) Summary.(b) Permutation p-values for the restricted (y-axis) and unrestricted (x-axis) AUCs within the anti-PD1 cohort.(c)

Supplementary Figure 11
Restriction applied to microbiomics Supplementary Figure 12 a • Mass cytometry data from Lozano et al. 2022.
value unrestricted AUC Permutation q−value max restricted AUC
* Phylum:Firmicutes Class:Clostridia Order:Eubacteriales Family:Lachnospiraceae Genus:unclassified Species:[Eubacterium] rectale.Now known as species Agathobacter rectalis, NCBI:txid39491.* Supplementary Figure 14 Zhang et al. (2022) built multiple machine learning models to predict objective response rates in immune checkpoint inhibitor-treated patients.They gathered single-cell data from 345 patients in 17 cancer types to find the "Stem.Sig" gene signature, a list of 454 differentially expressed genes between non-/malignant cells identified by CytoTRACE.In accordance with the article, we built multivariate random forests on the subset of 369 genes, identified in the 921 bulk RNA-seq measurements.Zhang et al. (2022) partitioned the dataset into 618 bulk RNA-seq training samples, 154 validation and 149 test samples.The training and validation samples were from five studies and randomly split into 80% training and 20% validation samples.The test samples were from five other studies.All bulk samples originated from five cancer types and were treated with one of four ICI treatments.The clinical outcome objective response rate (ORR) was defined according to RECIST v1.1 or irRECIST criteria in one study.Patients were classified as responders (complete response or partial response) or non-responders (stable disease or progressive disease).Default parameters were set as 1000 trees, a maximum depth of 20, a minimum number of leaf observations of 1, 20 bins, 21 random candidate biomarkers per split and a sample rate of 0.632.Parameter optimization was performed with a maximum of 500 models, random parameter selection of 50-1000 trees, 3-10 random candidate biomarkers per split, a depth of 5-30, 1-3 minimum number of leaf observations of 1-3, 20-50 bins and a sample rate of 0.55-0.85.The imputed value for samples outside the informative range was defined as -20, being outside the range of all values across all samples.
(a) Summary.(b-e) ROC curves of restricted (blue) and base (red) random forest models on validation (left) and test (right) samples with (bottom) and without (top) optimization of random forest hyperparameters.